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Abstract. Simulated and parallel tempering are families of Markov Chain Monte Carlo algo¬ 
rithms where a temperature parameter is varied during the simulation to overcome bottlenecks to 
convergence due to multimodality. 

In this work we introduce and analyze the convergence for a set of new tempering distributions 
which we call entropy dampening. For asymmetric exponential distributions and the mean field 
Ising model with and external field simulated tempering is known to converge slowly. We show 
that tempering with entropy dampening distributions mixes in polynomial time for these models. 

Examining slow mixing times of tempering more closely, we show that for the mean-field 
3-state ferromagnetic Potts model, tempering converges slowly regardless of the temperature 
schedule chosen. On the other hand, tempering with entropy dampening distributions converges 
in polynomial time to stationarity. Finally we show that the slow mixing can be very expensive 
practically. In particular, the mixing time of simulated tempering is an exponential factor longer 
than the mixing time at the fixed temperature. 


1. Introduction 

Markov Chain Monte Carlo (MCMC) methods for sampling from a target distribution have 
become ubiquitous in Bayesian statistics fT\, fields such as machine learning |2j, and in simula¬ 
tions of large physical systems 1341 . MCMC has also played an important role in several central 
results in the theory of algorithms ||T9ll2ll. 

It is usually straightforward to design an MCMC algorithm which converges to the desired tar¬ 
get distribution. Unfortunately, a common difficulty in applications from statistics and statistical 
physics is multimodality in the distribution which can cause the algorithm to take an impractical 
amount of time to converge. 

Simulated tempering Eiiini and Metropolis-coupled MCMC (or parallel tempering, or 
swapping) lIT^ are Markov chain samplers related to simulated annealing which are widely 
used for sampling in the presense of multimodality. Their popularity in practice makes it im¬ 
portant to understand their convergence rates theoretically. In these algorithms, a temperature 
parameter is randomly updated over a range of values during the simulation. The idea is to speed 
up sampling at low temperatures, circumventing the bottlenecks of the multimodal distribution 
by sampling some of the time at higher temperatures. In the following discussion, by “converges 
quickly” or that there is “fast convergence” we mean that the Markov chain converges to within 
a small distance of the equilibrium stationary distribution in time that is polynomial in the size 
of the states. “Slow convergence” means that even after an exponential amount of time in this 
parameter, the chain is far from the equilibrium distribution. 

Obstructions to the fast convergence of the dynamics often arise in models from statistical 
physics which exhibit phase transitions. The Ising model and its generalization, the Potts model 
are models from statistical physics of large numbers of interacting particles where the equilib¬ 
rium distribution exhibits multimodality. Due to this, local MCMC algorithms for simulating 
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such systems converge slowly. Madras and Zheng ETll analyzed simulated tempering for two 
symmetric bimodal distributions - the mean field Ising model and an “exponential valley” dis¬ 
tribution. In both cases they showed that tempering and swapping converge quickly at any tem¬ 
perature. Their analysis makes use of the decomposition theorems of |[25l which say that it is 
sufficient to bound the convergence time of the chain within each mode and of the macro-chain 
over all the temperatures by a polynomial. 

In contrast, in 14], we showed that for the 3-state mean-field ferromagnetic Poffs model (which 
is nof bimodal, buf rather, has three modes at low temperature), simulated tempering converges 
prohibitively slowly. This is caused by a phase transition in the Potts model which is of a 
different type than the phase transition in the Ising model. In fact due to the nature of the phase 
transition, simulated tempering converges slowly regardless of the intermediate temperatures 
chosen for tempering The proof of this theorem appears in unpublished form in iQ and we 
include the full proof here. 

Woodard, Huber and Schmidler llTTlI^ generalized the above examples to give frameworks 
for polynomial and slow convergence of simulated and parallel tempering for more general 
measures and state spaces. In particular in |[37l . they give sufficient conditions on a distribution 
and a sequence of temperatures for simulated and parallel tempering to converge polynomially. 
In Il3^ they show several cases where if the above conditions are violated, simulated and parallel 
tempering will converge slowly. In particular, one property that plays an important role is what 
they term the “persistence” of a distribution. Roughly, they show that if there is a single mode 
of the distribution which is very narrow or “spiky” compared to the other modes but has about 
the same probability mass then tempering and swapping converge slowly. Woodard et al. use 
this property to explain the slow convergence for the 3-state Potts model as well. 

The main results of the current paper touch upon these last points. In [4], we also extended 
the results of Madras and Zheng showing polynomial convergence of swapping for symmetric 
exponential distributions to the case of an assymetric exponential distribution. This more gen¬ 
eral result leads to the insight that it is possible to choose the distributions for tempering more 
advantageously if we do not restrict to distributions that are paramterized by temperature. As an 
application in |4| we cited polynomial convergence of swapping (and hence simulated temper¬ 
ing) for the mean-field Ising model wifh an exfernal field. In fhis paper we presenf fhe full proof 
of fhis resulf. We define cerfain “enfropy dampening disfribufions” which make use of proper¬ 
ties of fhe stationary measure. We show that if entropy dampening distributions are used for the 
3-state mean-field Poffs model, fhen in facf fempering mixes polynomially. These examples of 
polynomial mixing do nof fall under fhe sufficienf condifions given in |[37]l since we make use of 
more general fempered densities. 

Lasfly, we show fhaf fhere are cases when fhe mixing time of fhe fempering algorifhm can be 
significanfly slower fhan fhaf of fhe fixed femperafure Mefropolis algorifhm; i.e., even if we use 
a polynomial number of disfribufions, fhe mixing fime of fhe fempering chain may be exponen¬ 
tially larger than that of the chain at the fixed low femperafure. This confradicfs fhe convenfional 
wisdom fhaf fempering can be in fhe worsf case slower by a faclor fhaf is polynomial in fhe 
number of femperafures. Our proof makes use of sharper resulfs abouf fhe slow mixing beyond 
fhe condifions in ll4l lT7l . 


^Some of these results appear in an extended abstract fll and in thesis form Ol- This is the full version which 
contains complete proofs of all the results. 
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We point out that the examples considered here are tractable by other means and one does not 
need a Markov chain to sample configurations. Neverthless, we feel the methods presented here 
offer some insight into how to design more robust tempering algorithms in general. 

The remainder of this paper is organized as follows. In Section |2] we present some prelimi¬ 
naries on spin systems and Markov chains. In Section |3] we define fhe simulafed fempering and 
swapping algorifhms formally. The sfafemenfs of fhe main fheorems can be found in Section |4] 
In Secfion [5] we analyze fhe convergence time of fhe swapping algorifhm for asymmefric expo¬ 
nential disfribufions. In SecfionOwe show fhaf fhe swapping algorifhm using a modified enfropy 
dampening disfribufion mixes polynomially for sampling from fhe mean-field Ising model wifh 
any exfemal field. In Seclion|7]we show fhaf fhere is a femperafure af which simulafed fempering 
mixes exponentially slowly for fhe 3-slale mean-field ferromagnetic Polls model. Finally, in Sec- 
tion[8]we show fhaf in cerfain cases tempering can slow down fhe convergence of fhe Melropolis 
algorifhm by an exponential factor. 

2. Spin Systems and Markov Chains 

The q-state Potts model |[32]| on a finite graph G = (F, £) al inverse femperafure P > 0 wifh an 
external magnetic field is defined as follows. The sel of possible configurations is {1,... 
where a configuration x is an assignmenl of one of q spins lo each verlex of G and x, denotes 
fhe spin of / E F. The case q = 2 corresponds fo fhe classical special case of fhe Ising model. 
In Ihis case, fhe sel of spins is conventionally faken to be {-|-1,—1} and we will follow Ihis 
nolalion. Spins may also be referred to as colors and we use fhe Iwo interchangeably. For a spin 
configurafion x, lei a(x) = a = (ai,... ,a^) where a, denotes fhe number of verlices on x wifh 
spin i for I <i < q. The Hamiltonian of a configurafion x is defined by 

{iJ)€E m=\ieV 

where is fhe Kronecker della function and h = numbers represenling fhe 

external fields. The probabilily fhaf fhe system has a given configuration x al inverse femperafure 
P is given by fhe Gibbs distribution-. 

em^) 

where Z(P,/j) is a normalizing consfanl known as fhe partition function. In fhe case fhaf fhe 
exlernal fields are 0, we denole fhe parfilion function by Z(P). The higher fhe inverse lemper- 
alure P fhe more fhe disfribufion favors configuralions which have many neighboring verlices 
wifh fhe same spin. Af P = 0, i.e. infinile femperafure, fhe Gibbs disfribufion is uniform over all 
configuralions. Here we are concerned wifh P > 0 which is the. ferromagnetic Polls model. In 
conlrasl, in fhe anti-ferromagnetic case P < 0, neighbors in fhe underlying graph prefer lo have 
differenl spins. 

2.1. Markov Chains. In fhe MCMC mefhod, fhe Markov chain performs a random walk on 
fhe Markov kernel, which is a graph defined on fhe space of configuralions. One such Markov 
chain for sampling from a Gibbs disfribufion is fhe heat bath Glauber dynamics. Slarfing al a 
slate xq E O, al each lime slep a verlex is chosen al random from F and ils spin is updated by 
choosing if according fo Tip conditioned on fhe spins of fhe olher vertices. Thus fhe kernel for 
Ihis chain is fhe graph on Q. where fhere is an edge belween Iwo configurations if fhey differ 
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by the spin of one vertex. It can be checked that the chain is reversible with respect to tip and 
ergodic and thus Tip is the stationary distribution. 

In general, given a connected kernel, it is straightforward to sample from a desired distribution 
71 on using the Metropolis-Hastings algorithm BTTl . Suppose that Q is the transition matrix 
of irreducible, symmetric Markov chain over the state space fl; this will be the proposal chain. 
The transition matrix P of the Metropolis Markov chain is given by 

[ ^-Lz^xP{x,z) ify=x. 

The chain P is irreducible and reversible with respect to n and therefore n is the stationary 
distribution of P (see e.g. ITTII ). 

The convergence of a Markov chain can be measured by the mixing time, the time for the chain 
to come within a small distance of the equilibrium distribution. Let (X,) be an ergodic (i.e., irre¬ 
ducible and aperiodic), reversible Markov chain with finite state space Q., transition probability 
matrix P, and stationary distribution 7t. Let P'{x,y) denote the t-step transition probability from 
X to y. 

Definition 2.1, The total variation distance at time t of (Xt) to stationarity is 

||P',7i||fv = max;^ ^ |P'(x,y)-7t(y)|. 

Definition 2.2. Let 0 < s < 1, then the mixing time t(s) is defined to be 

x(s) := min{t : ||P^',7i||fv < s,Vf' > t}. 

We say that the Markov chain (X,) mixes polynomially if the mixing time is bounded above 
by a polynomial in n and log^, where n is the number of coordinates of each configuration in 
the state space. When the mixing time is exponential in n, we say the chain mixes torpidly or 
slowly or exponentially slowly. 

There are several methods to obtain a bound on the mixing time of a Markov chain. The 
inverse of the spectral gap of the transition matrix of a Markov chain characterizes the mixing 
time as follows. Let Xo,Xi,... be the eigenvalues of an ergodic reversible Markov chain 

with transition matrix P, so that 1 = Xq > |Xi| > |X,| for all i > 2. Let the spectral gap be 
Gap{P) :=Xo-|Xi|. 

Theorem 2.3 (|[22l). For any s > 0, 

\Gap{P) J ^ - Gap{P) ^ \n*eJ 

where 7t* = min;i:7t(x). 

2.2. Mean-field Models. The Curie-Weiss or mean-field Potts model corresponds to the case 
when the graph G is the complete graph on n vertices. Mean-field models are studied (see e.g. 
l2] and references therein) because often in higher dimensions they share characteristics of the 
model on lattices. 

For the mean-field Potts at low enough temperatures, local dynamics such as Glauber dy¬ 
namics mix exponentially slowly |0 [T3l [301 El [33. This is because at low temperature, the 
distribution is multimodal, consisting of ordered modes corresponding to configurations which 
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are predominantly of one spin. These modes are separated by configurations which are expo¬ 
nentially unlikely in the Gibbs distribution. As the temperature is raised, there is a critical 
temperature beyond which a single mode of disordered configurations dominates since the con¬ 
tribution of the entropy of configurations dominates the energy, or Hamiltonian, term. For more 
details on the mixing time of Glauber dynamics for mean-field models, see |[T^ l9l. 

The Swendsen-Wang (SW) algorifhm Il35l is anofher algorifhm proposed as an alfemafive 
fo local dynamics for sampling from configurations of fhe ^-sfafe Polls model. Cooper, Dyer, 
Frieze and Rue [Si] considered fhe mean-field Ising model and showed fhal fhe SW algorifhm 
mixes polynomially al all femperalures excepl possibly near fhe critical poinf. Gore and Jerrum 
ifT^ showed fhal fhe SW algorifhm mixes torpidly on fhe mean-field Polls model for <7 > 3 af fhe 
critical femperalure. Long, Nachmias and Peres |[^ have resolved fhe order of fhe mixing lime 
of SW af fhe crilical poinf for fhe Ising model. Recenfly, Galanis, Slefankovic and Vigoda and 
have sludied fhe mixing lime of fhe Swendsen-Wang algorifhm for fhe mean-field Polls model 
when <7 > 3 and shown four differenl regimes depending on fhe inverse femperalure Ifldll . 

3. Simulated Tempering and Swapping 

Simulated and parallel lempering are families of Markov chain algorifhms fhal have been 
proposed for sampling from mulfimodal distributions. They are used widely in practice and 
their convergence behavior for mean-field models has led to a better understanding of when these 
algorithms can speed up mixing of local Markov chains E7l[39l l4l lT7l[Mi . The simulated and 
parallel tempering Markov chains are built on top of a fixed temperature Metropolis-Hastings 
Markov chain. We define these chains in the context of sampling Gibbs distributions below, 
although it will be clear from the definitions that the chains may be defined for more general 
distributions, each on the space Q,. 

3.1. Simulated tempering. Suppose that we wish to sample from a Gibbs distribution up over 
D at inverse temperature p. The simulated tempering Markov chain is defined as follows if^fTTIl . 
Fix 0 = po < • • • < pM = P> a sequence of inverse temperatures. The state space of the simulated 
tempering chain is given by 

X {0,... ,M}. 

Define the tempering distribution 7 t, as the Gibbs distribution at P, 

7i,:=7ip., 0<i<M. 

Denote by M, a Metropolis-Hastings chain for sampling from 7 t, where at each time, indepen¬ 
dently, the proposal chain chooses a vertex in V where \V\ =n and spin in {1,... ,^} indepen¬ 
dently and uniformly at random. 

The tempering Markov chain consists of two types of transitions: level moves, which update 
the configuration while keeping the temperature fixed, and temperature moves, which update the 
temperature while remaining at the same configuration. In each step of the chain, we randomly 
choose with equal probability one of the two transitions to perform (c.f. |[27ll for other ways to 
define the chain). 

• A level move connects (x, i) and (x', /) with the transition probability given by 

Psti{x,i),{x,i)) := ^Mi{x,x). 
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• A temperature move connects {x,i) to (y,/± 1). If the current state is {x,i), choose an inverse 
temperature j = i±l with probability rij, where ro.i = = 1/2 and r/y = 1/2 for 0 < 

i < M. The move to {x,j) is accepted with the appropriate Metropolis probability. Thus, the 
transition probabilities are given by 


P,ti{x,i),{x,j)) 



\J-i\ = 1 
li-*l > 1 
j = i 


It is straightforward to verify that the chain is reversible with respect to The transition 
probabilities ensure that the stationary distribution list is uniform over all temperatures and the 
conditional distributions Usti-p) for 0 < / < M are proportional to the fixed temperature Gibbs 
distributions 7t,. That is, 


^St ( ■) 0 


1 

M+1 




If M is chosen to be a polynomial in n, the stationary weight of set of states at each fixed 
inverse femperafure P, is af leasf an inverse polynomial fraction of fhe slate space. A common 
choice of inverse lemperalures is fo lake P,- = /p/M. If can be verified lhaf in Ibis case if M 
is al teas! polynomial in n, fhe Iransilion probabilities are non-negligible, by bounding fhe size 


of fhe ratio 


z(P,) 


■ Nolice lhaf while Ihe exponential factor is simple to calculate given x and 
i, if is nol clear lhal we can compute Ihe ratio of partition functions in order to implemenl fhe 
simulated tempering algorilhm. The swapping algorilhm is designed to avoid Ihis difficully in 
implementing lemperalure moves. 


3.2. Swapping. The swapping algorilhm was defined by Geyer Util. Lei tip be fhe Gibbs dis- 
Iribulion from which we wish to sample. Fix 0 = po < ■ • • < Pm = P> a sequence of inverse 
lemperalures. The slate space is Ihe producl space D.sw = Ihe producl of M + 1 copies of 

the original state space, where each coordinate corresponds to an inverse temperature. A config¬ 
uration in the swapping chain is denoted by an {M + l)-tuple x = (xq, ... ,xm)- As before, define 
Hi := Tip. for 0 < / < M and lei M,- be a Melropolis-Hastings chain for sampling from Hi where 
al each time proposal chain chooses a vertex and spin independenlly and uniformly al random. 
Define Hsw to be Ihe producl measure of Ihe dislribulions Hi 

M 

ttiw(x) != J~j7t;(X;). 
i=0 

In each step, Ihe swapping Markov chain chooses an inverse lemperalure p, uniformly al random 
and chooses uniformly from Ihe following Iwo types of Iransilions. 


• A level move connecls x = (xq, ... ,Xi,... ,xm) and x' = (xo,... ,x/... ,xm) if and x' agree in 
all bul the components, and x, and x- are connected by one-step transitions of the Metropolis 
algorithm on Q.. In this case, the transition from x to x' has transition probability 

Psw{x,x) = 

• A swap move connects x = (xq, ... ,x/,x;_|_i,... ,xm) to x' = (xq, ... ,x,_|_i,x/,... ,xm), i-e., it 
exchanges the and i -\- components with an appropriately chosen Metropolis probability. 










SIMULATED TEMPERING AND SWAPPING ON MEAN-FIELD MODELS 


7 


From the current state a, choose a coordinate i uniformly at random from {0,... ,M — 1}. Let 
x' be the configuration obtained by exchanging the and / + F' components of x. Then, the 
probability of the transition from x to x/ is given by 


Pswip^^^ ) - 


1 ) 

-mm 1,- 

2M \ Kswix) 

1 . f ni+i{xi)ni{xi+i)\ 

\ tl; (V;) 71(^1 ) y 

— min (1 
2M V ’ 


It can be verified fhaf fhe chain is reversible wifh respecf fo Tisw and fhus has sfafionary disfribu- 
fion Since is a producf measure, samples according fo Km can be obfained by projecfing 
on fhe lasf co-ordinale. Nofice fhaf in fhe ttansifion probabilities above, fhe normalizing con- 
sfanfs cancel ouf. Hence, implementing a move of fhe swapping chain is slraighlforward, unlike 
tempering where good approximations for the partition functions are required. Zheng proved 
that fast mixing of the swapping chain implies fast mixing of the tempering chain |[39l . The 
converse result is not known. 

To define fhe fempering and swapping Markov chains in fhe case of fhe mean-field models 
fhaf we will sfudy, fhe base proposal chain for fhe Mefropolis chain af a fixed femperafure will 
be fhe heat bath Glauber dynamics. Thai is, af each fime sfep, a uniformly random verfex and 
a uniformly random spin is chosen, and fhe spin of fhe chosen verfex is updafed fo fhe chosen 
spin. 

For bofh fempering and swapping, we musf be careful abouf how we choose fhe number of 
disfribufions M + 1 and fhe disfribufions fhemselves. If is imporfanf fhaf successive disfribufions 
71, and 71, +1 have sufficienfly small variation disfance so fhaf femperafure moves are accepfed 
wifh non-frivial probabilify. Af fhe same fime, M musf be small enough so fhaf running fime 
of fhe algorifhm does nof become very large. Selling M fo be a polynomial which is Q.{n) and 
setting p,- = fp/M is often a reasonable choice. 

In general, one can define a sequence of disfribufions 7to,...,7tM and define fhe simulated 
fempering and swapping chains wifh Ihese as fhe fixed femperafure disfribufions by defining fhe 
fransifion probabilities using fhe Mefropolis rule as before. In fhe sequel we will make use of 
Ibis fo define tempering and swapping chains for fhe mean-field Ising model wifh an external 
field. 


4. Results for Mean-Field Models 

Allhough fhe simulated tempering and swapping algorilhms have been defined above as hav¬ 
ing fixed femperafure disfribufions which are Gibbs disfribufions, in facl fhe algorilhms are more 
general. Our firsl resull makes use of Ibis and bounds fhe mixing fime of fhe swapping chain 
which uses a differenl sel of fixed femperafure disfribufions (defined in Section Itidl) . 

Theorem 1. The swapping Markov chain for the ferromagnetic mean-field Ising model using 
enlropy dampening disfribufions mixes polynomially for every inverse temperature p > 0 and 
any external field. 

Unlike the Ising model, simulated tempering for the 3-state Potts model mixes slowly. 
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Theorem 2. Let pc = There is a constant ci > 0 such that for any set of inverse tem¬ 

peratures pc = pAf > • • • > Po > 0 such that M = n^^^\ the tempering and swapping chains 
with the distributions tl^-for the 3-state mean-field ferromagnetic Potts model have mixing time 
x(e) > e'^‘”ln(l/s). 

The slow convergence of the tempering chain is caused by a first order phase transition in 
the 3-state ferromagnetic Potts model. First order phase transitions are characterized by phase- 
coexistence of ordered and disordered phases at a critical temperature ifTSl . In contrast, the 
Ising model has a second-order (continuous) phase transition, and there is no phase coexistence, 
and this distinguishes why simulated tempering works for one model and not the other. Our 
techniques using entropy dampening distributions do not seem to extend immediately to show 
polynomial mixing of the swapping algorithm for the Potts model. 

Let LIrgb denote the subset of the state space of the 3-state Potts model Q. where Gi > 02 > 0 ^. 
On the restricted space LIrgb, we show that tempering can slow down the Metropolis algorithm 
at a fixed temperature by an exponential multiplicative factor. 

Theorem 3. There are constants c,c' with Q <c' <c and an inverse temperature P such that the 
Metropolis chain on TIrgb M P has mixing time t(s) < e‘^”ln(l/s) while the mixing time of the 
tempering chain is bounded by t(s) > e“ln(l/s). 

Though the mixing time of the Metropolis chain is exponential, to obtain this upper bound, 
it is not sufficient to bound the conductance, since such a bound is tight only up to quadratic 
factors. Instead, we will appeal to a refinement of the comparison theorem of Diaconis and 
Saloff-Coste. 


5. Swapping for the Asymmetric Exponential Distribution 

In this section we show bounds on the mixing time of the swapping Markov chain on the asym¬ 
metric exponential distributions generalizing the symmetric exponential distribution for which 
Madras and Zheng showed swapping mixes in polynomial time Il27l . This example will also 
serve as a warm-up for the analysis of the next section for the mean-field Ising model. While we 
focus here on the swapping algorithm, which is easily implementable, the distributions we define 
can also be used for tempering. Indeed, Zheng has shown that polynomial mixing of swapping 
with any distributions implies polynomial mixing of tempering with the same distributions Il39ll . 


5.1. Preliminaries. The proof makes use of a Markov chain decomposition theorem Il26ll29l . 
Let 9Jt be a Markov chain with transition matrix P and stationary distribution n. Let ,..., 
be a disjoint partition of the state space Q.. Lor each i G [m], define the Markov chain 971,• on fl,- 
whose transition matrix Pi, called the restriction of P to D, is defined as 

• Pi{x,y) =P{x,y), ifx/yandx,yGD,-; 

• Pi{x,x) = l- Y, VxGD;. 

yeCli,yy^x 

The stationary distribution of 971,- is given by 


X G Q-i- 
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Define the projection P to be the transition matrix on the state space [m] 

1 


= 


£ 7t(x)P(x,j). 


xeCii,yeQ.j 

The decomposition theorem bounds the spectral gap of the chain 571 by the spectral gap of the 
projection chain and the gap of the slowest restriction chain. 

Theorem 5.1 (Martin and Randall |[29l ). 

1 


Gap{P) > -Gap{P) ( minGap{Pi) ) . 
2 y/€ [m] 


We will use a comparison theorem of Markov chains to bound the mixing time of the pro¬ 
jection chain defined below. The following comparison theorem of Diaconis and Saloff-Coste 
can be used to bound the mixing time of a Markov chain when the mixing time of a related 
chain on the same space, but with possibly a different stationary distribution is known. Let 9Jti 
and 9712 be two Markov chains on Q.. Let Pi and Tij be the transition matrix and stationary 
distributions of 971i and let P 2 and 712 be those of 9712. Let E{Pi) = {(x,y) : Pi{x,y) > 0} and 
E{P 2 ) = {{x,y) : P 2 {x,y) > 0} be sets of directed edges. For x,y G D such that P 2 {x,y) > 0, 
define a path a sequence of sfates x = xq,. . . ,Xk =y such thaf Pi(xj,x,+i) > 0. Finally, lef 
r(z,w) = {{x,y) G £'(^ 2 ) : {z,w) G denote the set of endpoints of paths that use the edge 
iz,w). 

Theorem 5.2 (Diaconis and Saloff-Coste lITOl ll. Let a = min ( -^ 7 ^ ) . Then 

X \ni{x)J 

Gap{Pi) > j-Gap{P 2 ), 

where 

A= max \ I -- y \y^\'K 2 {x)Pi{x,y)\. 

(z,w)€£(P,) [7tl(z)Pi(z,w)^^^j' ^ ^ ^ 


Note that in the case that the stationary distributions of the two chains are the same, the above 
reduces to the more commonly used version of the comparison theorem where a = 1. 


5.2. The Bimodal Exponential Distribution. Let C > 1 be a real constant. Let N and N' be 
positive integers. Define fhe bimodal exponential distribution over the integers in [—as 

Chl 

7t(x):=—, xG 

where Z is the normalizing constant. Define the distributions for the swapping chain Ps^ as 

CiffH 

71,-(x) := ——, 0 < / < M, X G {—M, ... 

where Z, is a normalizing constant and M is the number of distributions which we will assume 
is a polynomial in M -|- N'. Let P, be the Metropolis-Hastings chain for sampling from 7 t, where 
the base proposal chain is the simple symmetric random walk on {—M,... ,N'}. That is 

P = / 2 if |i - il = 1 or / = 7 = M or N'. 

10 otherwise 
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The State space for the swapping chain is = {—A^,... and its stationary distribution 

is the product measure 71^^ of the distributions 7t,. 


Theorem 5.3. The swapping chain P^w with distributions Kq,- - ■ ,7tM tis polynomially mixing. 


Since our goal in this work is not to optimize the running times but rather to distinguish 
between models where the mixing of tempering and swapping are polynomial vs. exponential, 
we do not state the precise polynomial upper bounds on the mixing time in the theorem. 

Definition 5.4. Letx= (xq, ... ,xm) £ ^sw The trace Tr{x) = t := (to, ...,tM)^ {0,1}^+' where 
ti = 0 ifxi < 0 and ti = 1 if Xj >0, / = 0,... ,M. 

The 2^+^ possible values of the trace characterize the partition we use to apply the decompo¬ 
sition theorem. Letting be the set of configurations with trace t, we have the decomposition 


Q. 


SW 


u 

r€{0,l}“+‘ 


Let Pt be the restriction of Psw to the configurations of fixed frace t. The sfafe space of 
fhe projecfion P is fhe M+ 1-dimensional hypercube, representing fhe sef of possible fraces t. In 
liTTll . fhe specfral gap for a differenl version of fhe swapping chain for fhe symmefric exponenfial 
disfribufion was analyzed by decomposition. Our analysis of fhe resfricfion chains is similar fo 
fhe analysis in fTlX for fhe correspondingly defined resfricfion chains. Analyzing fhe projecfion, 
however, becomes more difficulf, since in fhis case fhe sfafionary disfribufion over fhe hypercube 
is highly non-uniform. This reflecfs fhe facf fhaf af “low femperafures,” one side of fhe bimodal 
disfribufion becomes exponentially more favorable. We will resolve fhis by an applicafion of fhe 
comparison fheorem wifh an auxilliary chain. 


Mixing time of the restricted chains: 


Note fhaf if we ignore swap moves in fhe resfricfion chains Pt, fhe moves af each of fhe M +\ 
femperafures are independenf and according fo fhe Mefropolis probabilities. Lef P, be a modified 
chain which suppresses swap moves in P, while fhe frace is fixed af t. The following lemma 
allows us fo express fhe specfral gap of P, in terms of the spectral gaps of the independent chains 
at each fixed femperafure. 

Lemma 5.5. (Diaconis and Saloff-Coste lITOl ) For i = let Pi be a reversible Markov 

chain on a finite state space fl;. Consider the product Markov chain P on the product space 
D.0 X ... X D.M, defined by 


M 


P = 


M + 


^ i=0 'C 


(8)/(8)P/(8)/(8)...<8)/. 


M-i 


Then Gap(P) = -fj-j min {GapiPi)} . 

The disfribufion over resfricfed fo each of fhe M + \ femperafures is unimodal, suggesting 
fhaf Pf should be polynomially mixing af each femperafure. Madras and Zheng formalize fhis in 
ETll and show fhaf fhe Mefropolis chain resfricfed fo fhe positive or negafive parfs of fl, mixes 
polynomially. By Lemma 15751 and following fhe argumenfs as in ETll . if can be shown fhaf fhe 
Markov chains Pt are polynomially mixing for each t € {0,1}^+^ We omif fhese calculafions 
here since fhey are exacfly along fhe lines of fhose in ETl . Nexf we show P, mixes in polynomial 
time by comparing if wifh P,. 
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Lemma 5.6. For each trace t G {0,1}^+', the restriction chain Pt mixes in polynomial time. 


Proof. We make use of the comparison theorem Theorem 15.21 To apply the theorem, for each 
transition of Pf, we construct a canonical path consisting of moves in Pf. 

Let {x,x') be a transition of Pt with a = (aq, ... ,xm) and x' = (aq, ... ,a^). If {x,x') is a level 
move which updates the state at a fixed temperature, let the corresponding canonical path in Pt 
be the edge (x,x') itself. On the other hand, suppose {x,x') is a swap move which exchanges 
the ith and / + 1st components. Note that since the trace remains fixed, it must be the case that 
ti = ti+\. Suppose that a/ > Xi+\ > 0. In this case define the canonical path as the concatenation 
of two paths p\op 2 , each consisting of a sequence of level moves at a fixed temperature. 

• The path pi consists of a,- — a;+i level moves at the temperature i + 1 from a to 

(aq ,.. . A,'— 1, A;, A;, A;^2 j • ■ • ; ) • 

• The path p 2 consists of a, — a;+i level moves at the temperature i from 
(ao, .. .A;_1 ,A,-,A,-,A;+2, • ■ • ,Xm) tO x'. 

If, on the other hand, a;+i > a, > 0, the canonical paths are defined as the concatenation p\o p 2 
where 

• The path pi consists of a,+i — a,- level moves at the temperature i from a to 

(aq ,.. . , A,_ 1, A,-|- 1, A/^ 1, A(.j-2) ■ • ■ ) ) ■ 

• The path p 2 consists of a,+i — a,- level moves at the temperature i + 1 from 
(aq ,.. . , a,'— 1 , A,'-|- 1 , A/^ 1 , A (.|-2 ) • • ■ ) Xm ) to a . 

The idea behind the definition of the canonical paths is to use the higher probability state (either 
A; or A,+i) to ensure the edges along the path will always have sufficiently high weight. 

We can bound the factor A in Theorem 15. 2 1 as follows. Firstly note that the paths are at most 
polynomial in length and the number of transitions of Pt which utilize any transition of Pt are at 
most polynomial in number since there are only a polynomial number of possible states a, and 
A,+i. Hence, it is enough to show that for any transition (z,w) of Pt and any transition (a, a') of 
Pt such that (a,a') G r(z,w), the quantity 

%t{x)Pt{x,x') 

Tlt{z)Pt{z,w) 

is at most a polynomial, where tit denotes the stationary measure for Pt as well as Pf. This can 
be verified in a straightforward way by checking the possible cases. For example, assume that 
the transition (z,w) changes the /+ 1st coordinate of z. Then, either it is on a path p\ when 
Xi > A,+i or it is on a path p 2 and a,+i > a,-. Consider the first case that a,- > a,+i. In this case, 
and Zi = Wi = a,- and we obtain 


■K,{x)P,{x,x') 

%t{z)Pt{z,w) 


ni{xi)'iii+i{xi+i)mm 


f I 7I,+i(j:,)7I,(x,-+i) \ 


ni{xi)'iii+i{zi+i)mm 


(1 Ji:,-n(w,-+i) \ 

V ’ ic,+i(z,-+i) ) 


min(C('+')^''+i/“,C('+i)“'''+i/“) 


The last expression can be simplified using the fact that by construction of the path, Zi^i < w,+i. 
Finally, by the construction of the path and the fact that the level moves preserve the trace, we 
also know that Zi+\ > a,+i. Hence, we obtain 

%t{x)Pt{x,x') „)/M < ^ 

%t{z)Pt{z,w) 
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A similar calculation made for the case that (z,w) is an edge on the path p 2 so that Xi^\ > x,-, 
Xi < w,+i < Zi+\ and w, = Zi = shows that 

Tl,{x)Ptix,x') ^ ^ 

Tl, {z)Pt{z,w) 

Combining the bound on A with the polynomial mixing of Pt and applying Theorem 15.21 we 
conclude that the restriction chains Pt mix in polynomial time. □ 

Mixing time of the projection: 

The stationary probabilities of the projection chain are given by 

7t(t)= Y. ^sw{x). 
x:Tr{x)=t 

It can be verified that n is also a product measure over the temperatures. Let ft, denote the two 
point distribution at each temperature such that ft is the product of the {ft,}. 

To show the projection P mixes in polynomial time, we will compare it to the following 
simpler chain P on the M + 1 dimensional hypercube and with the same stationary distribution. 
In P,at each step we are allowed to transpose two neighboring bits, or we can flip just the first bit. 
Each of these moves is performed with the appropriate Metropolis probability. This captures the 
idea that for the true projection chain P, swap moves (corresponding to transpositions of bits) 
always have constant probability, and that at the highest temperature there is high probability 
of changing sign. Of course there is in addition the chance of flipping the bit at each lower 
temperature, but this seems to be a smaller effect. 

More formally, at each step in P, we pick i {0,...,M} and update the component 
ti by choosing with probability i.e., exactly according to the appropriate stationary 

distribution. In other words, the component is at stationarity as soon as it is chosen. Using 
the coupon collector’s theorem, we have 

Lemma 5.7. The chain P on {0,1}^+' mixes in time O (Mlog(M + s^')) and Gap{P)^^ = 
O(MlogM). 

We are now in a position to prove the following lemma. 

Lemma 5.8. The projection P of the swapping Markov chain is polynomially mixing on {0,1}^+^ 

Proof. To apply the comparison theorem, we translate transitions in the chain P, whose mixing 
time we know, into a canonical path consisting of moves in the chain P. Let fd') be a single 
transition in P from t = {to,.,tM) to t' = (to,. • ■, 1 — ti,... ,tM) that flips the bit. The 
canonical path from t to t' is the concatenation of three paths op 2 op 3 - In terms of tempering, 
Pi is a heating phase and p 3 is a cooling phase. 

• The path pi consists of i swap moves from t to {ti,to ,... ,f,_i,f,+i,... Am); 

• The path p 2 consists of one step that flips the bit corresponding to the highest tempera¬ 
ture to move to (1 — f,Ao, ■ • • Am); 

• The path pj consists of i swaps until we reach t' = {to,... ,l — ti,..., tM)- 
To bound A in Theorem 15.21 we will establish that 

ft(z) p{z,z!) > ]^{t)P{t,t'), 


( 1 ) 
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for any transition in the canonical path. Second, we need to ensure that the number of paths 
using the transition {z,z'), is at most a polynomial. These two conditions are sufficient to 
give a polynomial bound on the parameter A in the comparison theorem. For any {z,z') we have 
|r(z,z')| < so it remains to establish the condition in Equation[T] 


Case 1: Transitions along p\. 

Let z (to ? ■ • ■ ; ty — 1 ? to ty 5 • ■ • ti-t- 1 5 • ■ • ; ) Rlld z (to ! • • • ; to ty— 1 ? • • • ; ti — 1 5 1 ? • • • 5 ) ■ 


n{z)P{z,z) = 


7t(z) 


- min 1 


Tl{z) 


2(M+1) 

2 (MTTj 


( 2 ) 


First we consider 7t(z). 


M M 

^(z) = n L 7ir(x) = n7ir(zr). 

i—0 Tr{x)£—Z£ ^—0 


Assume, without loss of generality, that N <N'. Then we have 

< mni(^Kh-),7t,-(l-t,■))-!-. 

M+\ A A 

nn 

M+l’ 

where t* = {to,... ,t,_i,0,t,+i,... ,tM). We want to show that 7 t(C) < 7t(z). It is useful to partition 
t* into blocks of bits ti that equal 1, separated by one or more zeros. Let k < / be the largest 
value such that t<. = 0. It can be verified from fhe definition of fhe disfribufion fhaf 


7t,(l)7t-t+l(0) >7t;(0)7ti+i(I) 


From fhis facf, if follows fhaf 

n Mze) > n ^t(t;). 

£=k+l e=k+l 

Similarly, considering fhe nexf block of t* (i.e., fhe nexf sef of bifs such fhaf t^ = 1) until fhe firsf 
index k' such fhaf 4 / = 0 , 

k k 

n Mze)> n ^t(t;). 

e=k'+i 

Continuing in fhis way we find 

Yl^eize) > ntt^t^), 

t=; i=j 

and fhus 

k{z) > 7t(t*). 
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Likewise, by taking one more term, we find that n(z') > 7i(t*). Together with equation this 
implies 


n(z) P(z,z) > ^7i(t) 

Case 2: The transition along p 2 . Consider the transition from z = (q,to, • • • • • • Tm) to 

z' = (1 — tijlo, ■.. Tm) that flips the first bit of z- Repeating the argument from Case 1, it follows 
that 

min (7t(z),7t(z')) 

Therefore, again we find equation [T]is satisfied. 

Case 3: Transitions along p^. This is similar to Case 1. 

In all three cases, we find that if (z,zO one step on the canonical path from t to t', equation[T] 
is satisfied. Therefore, it follows that 


A = max _ 

(z,z')eE(P) 


r(z,z') 

n{z)P{z,z') 


= 0{M^). 


Hence, by Lemma 15^ applying Theorem l5.21 Gap{P) = Q.{M “^InM). 

□ 

This establishes all the results necessary to apply the decomposition theorem Theorem 15.11 
completing the proof of Theorem l5.3l bv Theorem 12. 3 1 


6. Mean-Field Models 


The analysis from the last section suggests how to design distributions for swapping and 
tempering in cases where the mixing time is not known or known to be exponentially large. We 
consider examples of mean-field models to illustrate the ideas. While the examples below are 
very specific to mean-field models, our results indicate that there are more robust methods for 
designing tempering and swapping algorithms. 

Mean-Field Ising Model with an External Field: An important special case of the ^-state Potts 
model with an external field is the mean-field Ising model in the presence of an external field. 
This model is defined by parameters ^ = 2, P > 0, the inverse temperature, and h, the external 
magnetic field. The Gibbs distribution over configurations vGfl = {-|-l, —Ij'^is 


(a) = 7tp,/,W 


exp (P (L-<; iLl ) ) 


Z(P,/j) 


where Z(p, /j) is the normalizing constant. We will show that with a modified set of appropriately 
“dampened” distributions, swapping can be used to sample configurations in this case. 
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6.1. Entropy Dampening Distributions. Traditionally, a convenient choice for the swapping 
and tempering distributions are the tempered distributions given by the Gibbs distributions 7t, for 
a chosen sequence of inverse temperatures. 

The idea for the new distributions we define stems from the observation that this may be 
a poor choice of interpolants because they preserve the first-order phase transition, as we will 
show in the next section is the case for the Potts model. We can do much better by exploring 
a wider class of interpolating distributions. To see the flexibility we have in defining the set of 
distributions, define 


P/(^) 


Zi 


where Z,- = Y,xeQ.'^i{x)fi{x) is another normalizing constant. When /,(x) is taken to be the con¬ 
stant function, then we obtain the usual tempered distributions. Recall that in the mean-field 
model with q spins, we let a = (Gi, ... ,G^) denote the numbers of vertices with spins 1,... ,^. 
Define 


fM) = , ^ ^ 

The Flat-swap algorithm or chain is then defined to be the swapping algorithm using the distri¬ 
butions po, ■. • ,Pm as defined above. We define fhe Flat-tempering algorithm analogously using 
the same set of distributions. 


6.2. Polynomial Mixing of Flat-Swap. For a configuration x G D, recall that we let a, be the 
number of vertices colored i and let a = g(x) = (cTi,... ,CT^), where = n. Define D.c C D 
to be the set of configurations with a,- vertices assigned color i. The total spins distribution is 
the discrete distribution on the set of possible a, 

Sa=Tl(0.a) = ^ n(x). 

For the Ising model, we set 



if in the configuration x, k vertices are assigned -|-1 and n — k we. assigned —1. Let p,- = P • 
Note that /,(x) is easy to compute given x. A simple calculation shows that 

= Qp'W = ^(pM(%,„-yt)))"- (3) 

The function /,(x) effectively dampens the entropy (multinomial) just as the change in temper¬ 
ature dampens the energy term coming from the Hamiltonian. Thus, all the total spins distribu¬ 
tions have the same relative shape, but get flatter as i is decreased. This no longer preserves the 
cut in the state space of the distributions for the usual swap algorithm. It is this property that 
makes this choice of distributions useful. 

Theorem 6.1. The Flat-swap algorithm mixes polynomially for every inverse temperature P > 0 
and any external field hfor the Ising model. 

We follow the strategy set forth in the proof of Theorem 15.31 using decomposition and com¬ 
parison in a similar manner. The total spins distribution for the Ising model is known to be 
bimodal above the critical temperature (at and below the critical temperature there is a unique 
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maximum), even in the presence of an external field. With our choice of distributions p„ it now 
follows that all M + 1 total spins distributions are bimodal as well. Moreover, the minima of the 
distributions occur at the same value of i G [n] for all M + 1 distributions. Let fmin be the value 
of i which is the minimum. Let and denote the equivalence classes of configurations 
which maximise the total spins distributions for i = M on either side of t min - 

Let be the state space of the chain. Define fhe frace Tr(x) = f € {0,1}^+^, 

where f,- = 0 if fhe number of + Is in x,- is less fhan fmin and lef f,- = 1 if fhe number of +ls in x,- 
is af leasf fmin- As in fhe case of fhe exponential disfribufion, we parfifion according fo fhe 
frace of fhe sfafe x. Lef P be fhe projection Markov chain for fhis parfifion and lef Pt denofe fhe 
corresponding resfricfion chains. 

We begin by showing fhaf fhe projection chain mixes in polynomial time. The idea of fhe 
proof fhaf fhe resfricfions mix polynomially is analogous fo fhe argumenfs in Lemma 15.61 al- 
fhough fhe defails are slighfly differenf for fhe Ising model, and we make use of resulfs of GTI . 

Lef p be fhe sfafionary disfribufion, which is fhe producf of fhe disfribufions p, each of which 
is a disfribufion on fhe fwo poinf space {0,1}. Wifhouf loss of generalify lef 0 be fhe coordinafe 
for which fhe fhe mode is lower af fhe femperafure M. 

Lemma 6.2. At every temperature i, P;«ax) < P;«ax)- 

Proof. By ©, 

P;(<ax) = ;^(pM(cT°ax))^ < ;^(pM(aLx))^ = P;(<ax)- 

□ 

Lemma 6.3. At every temperature i,for z G {0,1}, p,(z) is within a factor of 0{n) o/p,(G^ax)- 

Proof Clearly, p,(z) > P;(^max)- ’^be ofher hand, fhere are 0{n) equivalence classes of con¬ 
figurations of which is one and has fhe largesf relative weighf. Hence, p,(z) < 

□ 


Corollary 6.4. For every pair of temperatures i > j, 


Proof By Lemma [631 


p,(0)p/l) < 0{n^)pf\)pj{0) 


P,(0) 

Py(0) 


< 


< 

< 


0{n) 



0(«)^(pM(<ax))‘" 

Z • i-' 

0(n)^(pM(ctLx))'^ 


0{n^) 


P,(l) 

P/(l) 


□ 


Theorem 6.5. The projection Markov chain P is polynomially mixing on {0,1}^+^ 
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Proof. We appeal to the comparison theorem. Let P be the Markov chain on {0,1}^+' whose 
transitions choose a random index i and update t, to t- E {0,1} with probability proportional 
to pft'i). It is clear that P mixes in polynomial time since whenever the temperature i is cho¬ 
sen, the corresponding coordinate is at stationarity after the update. Since the temperatures are 
chosen uniformly at random among the M + 1 possible temperatures, standard coupon-collector 
arguments imply that the mixing time of P is 0{M\ogM). 

Let t = {to, ■ ■ ■ ,ti, and t' = (to, ■ ■ ■ - ■ ■ ,tM) be two states such that P{t,t') > 0. We 

define a path between them using transitions of P. Assume that t, = 0. In the other case, define 
fhe pafh fo be fhe reverse. Denofe fhe pafh by t = = t'. Fromz™, define as follows: 

• Lef j be fhe largesl index such fhaf zj = 0 and = 1 ■ 

• If fhere is a largesl index k < j, such lhal = 1, Ihen is obfained by swapping fhe 

bils in positions k and k+\. 

• If fhere is no such k, is obfained by flipping fhe bil zj} from 0 lo 1. 

The idea of fhe pafh is lo flip fhe bil al fhe index j from 0 lo 1 by moving a 1 up from fhe firsl 

available posilion, performing a series of swaps Ihrough a block of O’s. Al fhe end of fhe series 

of swaps, fhe difference al j has been removed, and fhe difference is now al fhe slarling poinf of 
fhe swaps. As fhe bil 1 moves up, fhere can be al mosl 2 differences due lo il: one al ils currenl 
position and Ihe olher al Ihe position where Ihe series of swaps began. Hence Ihere are at most 
4 indices where there could be a difference between t and z™ (the other two being the index j 
and possibly the index /). Moreover, by the construction, the indices must be such that for the 
highest, say i\, ti^ = 0 while zj” = 1 and the difference then alternates. Thus, we have that for 
any configuration z'" along the path, by Corollary 16.41 


P(l) 

plz™) 


< 0{ri^ 


Suppose that we fix a Iranslion z,z^ The number of pairs t,t’ such lhal fhe palh belween Ihem 
passes Ihrough z, z! can be bounded by since we musl only specify fhe positions al which z 
differs from t and possibly fhe index i lo be able lo reconslrucl bolh t and t'. We can now bound 
Ihe factor A in Ihe comparison Iheorem Theorem 15. 2 1 as follows 


r(z,z') 

p(z)p(z,z') 


< max_ 

(z,z')€£(P) 

< 0{n^MA- 

The claim follows by applying Theorem 15.21 and using Ihe facl lhal P mixes in polynomial 
time. □ 

Recall lhal Pt is Ihe reslriclion chain wilh a fixed Irace t E {0,1}^+^ and lei P, denote Ihe 
reslriclion chain where Ihe swap moves are suppressed. Then, P, consisls of independenl chains 


2 L Iym'Ip(0 

r(z,z') _ 

min(p(z),p(z')) 


A = max_ < 

(z,z')€£(P) 
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on each of the M +1 distributions. It was shown in ETIl that in the case of zero external field, each 
of the Pt are rapidly mixing by using Lemma 15751 the decomposition theorem and comparison 
of the restrictions with a simple exclusion process on the complete graph. The same analysis 
holds in our case, the main difference being the non-zero external field. Since fhe frace is fixed 
for each of fhe chains however, fhe only facl fhaf musf be checked is fhe analog of Lemma 14 of 
ETll which says fhaf fhe disfribufions af each femperafure are unimodal on eifher side of t^ia- We 
omif fhe calculafions as fhey are sfraighfforward fo check. Thus, we have fhaf Pt is polynomially 
mixing for each t G { 0 , 1 }^+'. 

Lemma 6.6. For each trace t G {0, the restriction chain Pt mixes in polynomial time. 

Proof. As in Lemma 15.61 fhe sfrafegy is fo use fhe comparison fheorem and compare fo fhe 
chain Pt. However, fhe sfafe space af each femperafure is no longer an inferval and fhus fhere are 
some additional sfeps. We will show fhaf fhe 2-sfep chain Qt = P} is polynomially mixing. This 
implies polynomial mixing for P,. To show fhaf Qt mixes polynomially, we use decomposifion. 
Lef Qt,a 0 .... .o'w denofe fhe resfricfion of fhe chain where af each femperafure, nof only is fhe frace 
fixed fo t, buf fhe spin configuration af femperafure i is in fhe sef D.^i. The projection chain Qt 
fhus moves on fhe sefs af each femperafure i. 

Nofe fhaf fhe resfricfion chains Qt,a°.- -exacfly fhe same as fhe resfricfion chains for 
(Pt)^ when fhe resfricfions fix fhe spin configurafion af each femperafure. The rapid mixing of 
fhis chain follows by fhe argumenfs in |[27l Section 7]. 

Thus, we are reduced fo showing fhaf Qt is polynomially mixing and we do fhis by comparing 

fo Qt, fhe projecfion on fofal spins of fhe fwo sfep chain when swap moves are suppressed. This 
can be done along fhe lines of fhe comparison proof in Lemma (5^ 

Lef {x,x') be a fransifion of Qt wifh x = (gQ) • ■ • j and x' = (Gq, ..., Gjj^). If (x,x') is a level 

move which updates fhe sfafe af a fixed femperafure, lef fhe corresponding canonical pafh in Qt 
be fhe edge {x,x') ilself. On fhe ofher hand, suppose {x,x') is a femperafure move so fhaf for 
some I <i < M, Oj = g'- for all j 0 {/, / + 1}, G, = G-^j and G,+i = G-. In fhis case, we divide 
fhe consfrucfion of fhe pafh info several cases based on fhe values of G, and G,+i. Nofe fhaf since 
fhe frace remains fixed, if musf be fhe case fhaf t, = ti+i. Wifhouf loss of generalify, we assume 
fhaf ti = I, since fhe calculation in fhe ofher case is exacfly fhe same. 

1) In fhe firsf case, fhe sfafes G,-,G;+i > G^^^ ct/,G,+i < G^^^’ thaf is, fhey are on fhe same 
side of fhe sfafe G^^^. In fhese cases, fhe consfrucfion of fhe canonical pafh is analogous fo 
fhe consfrucfion in Lemma 1531 

2) The fhe second case, G,- < G^^^ ^i+i ^ ^max or ^ ^^rd G;+i < G^^^- The pafh 
consisfs of fhe concafenafion of four pafhs. Firsf, we move wifh level moves af fhe femper- 
afure i from G; fo G^^^- Nexf, we move af femperafure i + 1 wifh level moves from G,+i fo 
G^ax- So far, fhe sfafionary weighf of sfafes along fhe pafh are non-decreasing. The nexf pari 
consisfs of fwo non-increasing pafhs. Firsf, we move af femperafure i from G^^x to G,+i. Iasi, 
we move af fhe femperafure i + 1 from G^^^ to G,-. 

If can be verified fhaf since fhe sfafionary measure along fhe pafhs is unimodal, fhe lenglh of 
any pafh is polynomial and fhere are af mosl polynomially many pafhs using any fransifion, fhe 
comparison conslanl A can be bounded above by a polynomial. 

□ 
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Since the projection chain P and each of the restrictions Pf mixes in time that is polynomial 
in n and M, Theorem ih.ll follows. 


7. Torpid mixing of Simulated Tempering 

In this section we show that for the mean-field 3-state ferromagnetic Potts model, there is a 
critical temperature so that for any distribution parametrized by temperature, the mixing time of 
the tempering and swapping algorithms is exponential. 

Theorem 12 Let Pc = There is a constant ci > 0 such that for any set of inverse tem¬ 

peratures Pc = Pm > • • • > Po ^ 0 such that M = n^^^\ the tempering and swapping chains 
with the distributions tip-for the 3-state mean-field ferromagnetic Potts model have mixing time 
t(s) > c^i"ln(l/s). 


We prove the lower bound on the mixing time of the tempering chain by bounding the conduc¬ 
tance. The slow mixing on the swapping algorithm then follows by Zheng’s result Il39l showing 
that polynomial mixing of the swapping chain implies polynomial mixing of the simulated tem¬ 
pering chain with the same distributions. The conductance is an isoperimteric quantity related 
to the spectral gap through Cheeger’s inequality, a version of which was shown independently 
by Jerrum and Sinclair ll20l and Lawler and Sokal 1211. It often gives an easier method for 
bounding the mixing time than directly bounding the spectral gap. For S C Tl, let 

£ 7t(x)F(x,y) 

Fs xeS,yiS 

-^(X)-■ 

Then, the conductance is given by 

<I> = min <I >5 

S:Jt(S)<l/2 

and it bounds the mixing time both from above and below. Cheeger’s inequality implies the 
following bounds on the mixing time. Let Timin = min7i(x). 


Theorem 7.1, For any reversible Markov chain with conductance <I> 


1 - 2 <I>, 
2 <I> 


-log 


1 \ /X 1 

— < t(£) < ^ 
2 s 7 < 1>2 


log 


^)+ilog 


1 -ttn 

^min 


The state space of the tempering chain is x [M-|-1] where Tl consists of spin configurations 
on the complete graph with three types of spins. To show torpid mixing, it is enough to exhibit 
a cut in the state space whose conductance is small. For convenience, let us call the 3 spins red, 
blue and green. The cut we construct depends only on the number of red, blue and green vertices 
in the configuration. Hence, for the purpose of defining fhe cuf, if is convenienf fo divide fhe 
sfafe space of configurafions Q. info equivalence classes of colorings according fo fhe number 
of vertices of each color. Furfhermore, fhe cuf we define will induce fhe same cuf on Tl af each 
femperafure. 

If is convenienf for fhe exposifion fo make fhe following reparamefrizafion using fhe facf 
fhaf for fhe mean-field Polls, fhe underlying graph is complefe. Lef H{x) = -I-G 2 + ^ 3 > 
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P = P/2 and let Z(p) denote the corresponding partition function. It can be verified that the 
Gibbs distribution at inverse temperature p can be written as 

W)' 

To define the cut, we partition into sets flo, where G = (Gi,G 2 )tT 3 ) is partition of n and 
contains all colorings with Gi,G 2 and G 3 vertices colored red, green and blue, respectively. It is 
helpful to think of the G as points on a simplex. The set corresponds to different 

configurations in Q. and hence we write 


7Ip(x) 




CtbtT2,Ct3 


Ma\+<sl+cl) 

m) 


(4) 


The idea for defining the cut with small conductance comes from the following properties of 
the stationary distribution conditioned on the sets flo- There is a critical temperature pc where 
the Gibbs distribution exhibits the coexistence of two modes. There is a “disordered” mode 
in the distribution at (|,|,|); this mode is present because though these configurations have 
small energy, the number of configurations (given by the multinomial term in Equation Q) is 
large. At pc, there are also “ordered” modes at (y, |), (|, y, |), (|, y)- These modes are 

present because configurations with a predominant number vertices having the same color (red, 
or green or blue) are favored in the Gibbs distribution, though there are not as many of these 
configurations. The ordered and disordered modes are separated by a region whose density is 
exponentially smaller than both the modes, where neither the multinomial nor the energy term 
dominates. As the inverse temperature is decreased below pc, the size of the disordered mode 
grows while the sizes of the ordered modes decrease. However, the region of exponentially 
small density remains small at every temperature. The cut in the state space of the simulated 
tempering chain at pc is to take a region surrounding the ordered mode at each temperature. The 
conductance of this cut, up to a polynomial (in M) is bounded by the conductance at the critical 
temperature where the modes coexist. This is because in the stationary distribution, the chance 
of being at each temperature is equally likely. In contrast, for the Ising model, there is no tem¬ 
perature at which the ordered and disordered modes coexist. We first present a straightforward 
upper bound on the conductance of the tempering chain at pc. 


Theorem 7.2. Let Pc = There is a constant C 4 > 0 such that the conductance <I> of the 

simulated tempering chain with distributions for any Pc = Pm > • • • > Po ^ 0 /or the 3-state 
mean-field ferromagnetic Potts model is at most 


The lower bound on mixing time by the inverse of conductance in Theorem 17.11 implies The¬ 
orem |2l In the next section, we will refine this bound in order to compare it to the upper bound 
on the mixing time of the Metropolis chain at a fixed temperature to show Theorem [3l 

Let A C be the set of configurations such that Gi ,G 2 ,G 3 < n/2. Let Pst denote the transition 
matrix of the simulated tempering chain. Let S = {{x,i) | x G A, Po < P; < Pc}- Let 

B = {x G A I 3 x' G L1\A, Pst{{x,i), {x ,i)) > 0 V 0 < / < M} 

be the boundary of A (the set of configurations with at least one of Gi,G 2 or G 3 equal to n/2). 
Our aim is to show that the conductance <I >5 of the set S is exponentially small. Note that it is 
not true that 7t(S) <1/2 and hence this does not immediately imply a bound on <I>. Instead, we 
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will show that the coexistence of the ordered and disordered phases implies that <I> < 
We start by bounding <I> 5 . 


L L L ( {x, i) ,{x,i)) 

Fs _ ieixeB __ 

Cs LL^p,w 

LL^p.w 

^ ieixeB 

~ LL^p.w 

i€l xeA 


(5) 


The last expression above is the ratio of the sum over temperatures of the stationary probabil¬ 
ities of configurations in the set B (the boundary of the set A) to the sum over temperatures of 
the stationary probabilities of the configurations in the set A. In order to bound this quantity, we 
will need several technical lemmas which we state in the course of the proof but prove later to 
maintain the flow of the argument. The proofs of these lemmas are gathered in Section ITA] 

For 0 < a < 1 let D.an denote the set of configurations where Oi = an and 02 = = (1 — 

a)njl. In the next step, we show that by losing only a polynomial factor, the numerator of ([2) can 
be bounded by the sums of the probabilities of the configurations (the set of configurations 
on the boundary B with equal numbers of green and blue vertices), while the denominator is 
certainly is as large as the weight of the configurations in (the set of configurations with 
equal numbers of red, blue and green vertices). In particular, we want to show that for some 
constant C, 


LL ^p.w 


LL ^p.w 

zG/xGA 


< 


Cn= ---- 

i€l 


( 6 ) 


We use the following lemma, which says that in the simplex, along the line where the number 
of red vertices is n/2, the distribution at every temperature has a unique maximum at the config¬ 
urations where the number of green vertices is equal to the number of blue vertices. We define 
fhe following function 71 ; (x) which inferpolafes fhe discrefe densify Tip. confinuously: 


71; (x) 


Fjn) 

r(f)r(xn)r((i 


/.F{{^Y+F+{^^xf) 

x) n) Z(p;) 


xG (0,1/2). 


Lemma 7.3. The function 7t;(x) has a unique maximum such that xn is integer in the range 
0 < X < 5 and attains its maximum at x = ^ for all i such that p; < pc- 

The proof appears in Section iTAl This implies fhe inequalify ®- Nexf, weTl show fhaf <I >5 is 
essenfially defermined by fhe conducfance of fhe cuf induced af fhe highesf inverse femperafure 
pM- 

Lemma 7.4. For every inverse temperature P; < Pc, 
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Proof. Note that only the exponential term in varies with P,. Letting h{n) be the ratio of 

the multinomial terms, we have 




□ 


This implies that the ratio on the right hand side of ® can be bounded as follows 


I 

iei 

I 


^P/ if^n/l) 
^P,(^«/3) 


Tip (fl„/2) 
<C6n . 

^Pc(^n/3) 


(V) 


for some constant cg > 0. There are two final steps to bounding the conductance. First, we will 
show that is exponentially small. Second, we will show that <I> < These facts 

follow from properties of the stationary distribution proved in Lemmas [7.51 and 171^ 

The following lemma demonstrates that there is a critical temperature at which and 
^ 2 h /3 both have large weight compared to Also, the configurations have a weight 
that is at least a polynomial fraction of the stationary weight of Q. at P^ 


Lemma 7.5. At P^ = 


2ln2 
n ’ 


(i) 7I[3^(fl„/3) — 7tp^(fl2„/3)+o(l). 

(ii) 

7Cpjn„/3) — 

(in) Tip^(a„/ 3 ) > ^ 


The proof of the lemma can be found in Section 17.11 Putting together the bound on <I >5 from 
inequality (|7]l and part ii) of Lemma 1731 we obtain that for some constant c? > 0, 

<^s < 

Lastly, we show the bound on the conductance <I>. We need the following lemma, which says 
that the stationary weight of the configurations on either side of the cut S are within a polynomial 
factor. 


Lemma 7.6. The stationary weight in the tempering chain of the set S is bounded as Kst{S) < 
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Proof. 




> 


fBv Lemma l7.5l l'/)) > 
fBv Lemma 17. 5 1 1'///)) > 


> 


ieixea\A 


1 1 
4^M+1 


7I,f(5) 


where the last inequality follows since 7 tp.(fl) = 1 . 


□ 


With this lemma in hand, we can bound the conductance of the tempering Markov chain at 
the temperature 



_ Fs^ 


C 5 C 

(By Lemma IT6|1 

< 

C 5 

(By reversibility) 

= 

Cs 




This bounds the conductance since <I> < max(<I> 5 ,<I>^) < for some C 4 > 0. This completes 
the proof of Theorem |7]2] which implies that simulated tempering is torpidly mixing. Zheng ll^ 
has shown that polynomial mixing of the swapping Markov chain implies polynomial mixing of 
the tempering chain. Thus the torpid mixing of simulated tempering implies that the swapping 
chain for the mean-field Potts model mixes exponentially torpidly also, for the same distributions, 
completing the proof of Theorem |2] 


7.1. Proofs of Technical Lemmas. We present the proofs now of the technical lemmas about 
the stationary distribution which were used in the previous section. 


Lemma 17.31 The function n,-(x) has a unique maximum such that xn is integer in the range 
0 < X < 5 and attains its maximum at x = ^ for all i such that p, < p^. 


Proof. Recall that we have defined 

T{n) 


71; (x) = 




r(f)r(xn)r((i-x)n) 


zm 


xG (0,1/2). 


Neglecfing the factors that are not dependent on x we can write the function that we would like 
to maximize as 


M ^ 
gW 


gp,n(,G+(i-x)2) 

r(xn)r((i — x) n) 
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and show that the unique maximum is at x = 1/4. To test the sign of the derivative > we 

compare the quantities -j and since both / and g are positive in the interval (0,1 /2). It can be 
verified that 4 = p,n(4x — 1) and 


= n\ - 


^^k + xn - 1 \ I ’ 


where we have used that the derivative of the gamma function is given by 


r'(x)=r(x) -y+£ -- 


k=i 


k k+x—1 


where y is the Euler-Mascheroni constant. Thus, we verify that there is a stationary point at 
x = \ since “7=0 = ^. We will verify that this is the unique stationary point in (0,1/2). Using 
the integral approximation bound YX=cn p ^ we have 


9) 


and 


■ + I 


“^{k + xn-\Y (k - x)n - lY 


>nU — + 


xn (j—x)nJ 2x{^—x) 


Therefore, 




>Sn> 4P^n > 4P;n = 


■/\ ' 


since Since the derivative of ^ is greater than the derivative of ^ at each point 


in (1/4,1/2), there are no stationary points in that interval. A similar argument by symmetry 
shows there are no stationary points in (0,1/4). □ 


Lemma 1731 At pc = 

(i) 7I[3^(fl„/3) = 7tp^(fl2„/3)+o(l). 

(ii) 

7Cpjn„/3) — 

(hi) %(a„/3) > 

Proof, (i) We solve for pc. Let Tip.(fl„/ 3 ) = tip.(fl 2 „/ 3 ). Then, 


n \ 

2n n n 
3 ’ 6’ 


md 


n n n 
3’ 3’ 3 


gP,("V3) 
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This implies 




(fO (10 (iO 


m (fo (to 

2n n 

:iva) 


which occurs when 


P/ = 


25 

21 n(2) 


(l + 0 (n-i)), 

ln(l + 6>(?i^^)) . 


+ 


y/ln?- 


Setting Pc to gives the desired result. 


(ii) Let Pc = ■ Then we have 


ttp^(^n/3) 


( " "I 

1 n « £ I 

'• 2 ’ 4 ’ 4 


,Pc(3«V8) 


^ 3 ’ 3 ’ 3 ^ 

v3 


(fO^ 


(10 m 

(T1 / 8 ^ 2 


p,nV24 


111 -iin 


pln(2)«/12(l + 0(„-l)) 

(^) (l + (9(n-i)) 


(hi) Let pc = ■ Consider any general point in the simplex of the form (x,y, 1 — x —y) for 

0<x + y< 1. It can be verified that the function 


h{x,y) = 


f{x,y) 


„P,n(x2+/+(l-j:-y)2) 


g{x,y) T{xn)T{yn)T{{\-x-y)n) 

has a global maximum at (1/3,1 /3), i.e. h{x,y) < h{l /3,1 /3) for allx,y such that 0 <x+y < 1. 
This can be shown by checking that h is maximized at (1/3,1/3) over all stationary points of 

h{x,y). This implies that Ttp^ (n„/ 3 ) > □ 


7.2. Polynomial Mixing of Flat-tempering the 3-state Potts Model. The above proof of slow 
mixing due to a first-order phase transition and the results of Section [5] together give the in¬ 
sight that the tempered distributions should be defined so that the first order discontinuity is not 
preserved. We show that the Flat-tempering algorithm with distributions p, defined below can 
be used to efficiently sample from configurations of the 3-state ferromagnetic mean-field Potts 
model at any temperature. The function / in this case is 
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With p,' defined as above, 

pi(^o) = C (8) 

\a 10203 J Zi 

Theorem 7.7. Let P = ^ for a constant fj > 0. Then, for some constant cs > 0 the simulated 
tempering Markov chain P with the distributions po,... ,Pm mixes in time 0{n‘^^). 


Proof The proof makes use of the decomposition theorem. The strategy is to partition the state 
space of the tempering chain List into the sets {Lla,i) for each equivalence class of configurations 
G and inverse temperature p,. To keep the notation simple, which we write the restriction sets as 
(g,/). The restriction sets (g,/)) are not connected by the chain P since it only moves between 
configurations which differ in the spin at exactly one vertex. We can get around this technicality 
by first bounding the mixing time of the 2-step chain P^. 

The chain P- mixes in polynomial time, which can be seen by comparison with the chain 
on (flfj,/) where in each step, the spins at two randomly chosen vertices are exchanged. This 
follows since the mixing time of this chain is only smaller than the mixing time of the inter¬ 
change process on the complete graph which is bounded by 0{n\nn) for the complete graph on 
n vertices (see e.g. im Chapter 14]). 

We analyze the projection by comparison to the complete graph on the states of the projection 
{(g,/)}. For every pair of states (g,/) and {o',j), we define a path using edges of P^ and show 
that the congestion of these paths is at most a polynomial. 

Assume without loss of generality that i < j. Let x(g,g') be a sequence of 0{n) states that 
is the set of vertices along a shortest path using edges of the projection chain in (fl,0) from 
flfj to flfj/, not including the endpoints. The path between (g,/) and {o',j) is defined to be the 
concatenation of the paths ((g,/), (g,/— 1), ..., (g,0)),x(g,g'), and ((g',0), ..., (G',y)). The 
observations we use to bound the congestion of the paths by a polynomial is as follows. 

i) Let Gmax be an equivalence class of configurations maximizing pM(f2fj). By ([8]l, for any i, 

PM(^Cmax) — 

ii) For any edge in the kernel of the Markov chain, the number of paths which are routed 
through it is at most 0{n‘^M^) < n^^^f taking into account the possible starting and ending 
states. 


Then, the congestion of the paths can be bounded as follows. We divide into two cases. The first 
where an edge corresponds to a change in the temperature and is of the form (g,/'), (g,/' — 1) 
for some i' < i (or (g,/), (g,/ -|- 1) for some/ < j). The second is an edge corresponding to a 
pair of adjacent states at the inverse temperature Po- 

• Assume that the edge is of the form (g, i') ,{o,i' — 1) for some i' <i. By the observations 
i) and ii) above. 


min 


A < n 


0 ( 1 ) 






Pm (^Omax) Pm (^Omax) 


mm 






pl/“(iia™ax)’pr‘>^“(iia„ax) 


, 0 ( 1 ) 


Pm(^o 


_Pm(^o„, 


I—I 

M 


< h 


0 ( 1 ) 


< 
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The other ease is analogous. 

• Suppose that the edge is a pair of adjaeent states at Po- Sinee for every a, po(f^o) = 
we have 


A 


< 


min 

- 


\ Pm (^Omax ) Pm (^Omax ) / 


< n 


0 ( 1 ) 


Finally, by applying the eomparison theorem, the polynomial mixing time of P implies that the 
mixing time of P is at most a polynomial. This follows sinee for any two adjaeent states of P, 
the ratio of the stationary probabilities is at least an inverse polynomial. Moreover, for any edge 
of the 1-step ehain, there are at most a polynomial number of possibilities for the other step. □ 


8. Tempering Can Slow Down Fixed Temperature Algorithms 

We have shown that simulated tempering ean mix torpidly. In faet, tempering ean be slower 
than the fixed temperature algorithm by more than a polynomial faetor. In this seetion we show 
that for the 3-state Potts model, at an inverse temperature P* just above the eritieal inverse 
temperature, on a restrieted part of the state spaee Q, simulated tempering ean be slower than 
the fixed femperafure Mefropolis ehain by an exponential faelor. The idea is fhaf allhough Ihe 
mixing time of Ihe Mefropolis ehain af P* is exponential, if is bounded by fhe size of fhe euf af 
P*, while fhe mixing time of fhe simulaled lempering ehain ean be an exponential mulfipliealive 
faelor worse beeause fhe eonduelanee of fhe same euf af fhe higher lemperafures is mueh smaller. 
Inluilively, on average, fhe ehain is spends even less lime mixing on bolh sides of fhe euf af fhe 
higher lemperafures lhan af P*. The preeise Iheorem we show is fhe following. Lef us denote 
by O.RGB = {x £ Q. : tJi > 02 > 03 } the subsel of fhe slate spaee for fhe 3-slate Polls model 
where Ihe number of verliees of Ihe lirsl eolor dominate Ihe number of Ihe seeond whieh in lurn 
dominate Ihe number of verliees of Ihe Ihird eolor. 

Theorem 8.1. Let P* = - where fu > 41og(2). Assume that the number of distributions for 
tempering is M = &{n). Then, there are constants 5 > 0 and a < 0 (which may depend on P*j 
such that the simulated tempering algorithm on LIrgb at P* mixes only after time 
The Metropolis algorithm at temperature P* mixes in time 

8.1. Torpid Mixing of Tempering for P* > We slarl by proving Ihe lirsl pari of Ihe 

Iheorem above by showing Ihe following bound on Ihe eonduelanee of Ihe simulated tempering 
ehain. Lei ^rgb denote Ihe eonduelanee of Ihe tempering ehain on TIrgb al inverse lemperalure 
pL 

Proposition 8.2. Let P* = ^ where p > 4 log (2). Then, there exists a <0 and 5 > 0 such that 
^RGB < e(“-5)«+A«). 

Define Ihe sel Krgb = {0 = ( 01 , 02 , 03 ) where 0i > 02 > 03 , Li0i = a}. Thus Krgb is the 
set of 0 eorresponding to the eonfigurations in LIrgb- For 0 G Krgb, the Gibbs distribution is 
given by 

Zrgb(P;) 

where ZRGsi^i) is the normalizing eonstant. 





^GB 


Schematics depicting the values of Tliox) along 
^QB at various values of P: 



(J1 n CJx. 

A / ^ f'-f 


Figure 1. The profile of the probability density function over Krqb 


Denote by Iqb, the set of equivalence classes of configurations , for 

j < X < 1 i.e., the subset of Krqb with partitions that have an equal number of blue and green 
vertices (see Figure [Hi. There exists a constant X„„>, (which can be found by differentiating the 
appropriate function), a value of X between the ordered and disordered modes where Tip* 
is minimized along the line ioB- Let be the corresponding set of spin configurations. 
Let pyvf = p* = ^ where q is a constant such that q > 4log(2). Let A C Q.bqb be the set of 
configurations x with x\ < Xminn- Let 5 = { (x, /) | x G A, Po < P/ < Pc}- Let B = {x G A | 3 x' G 
^rgb\'A, P{x,x') > 0} be the boundary of A. Then, as in (|5]), we can bound the conductance of 
the set S for the tempering chain as follows. 


< 


M 

£Lttp.(x) 

i=0xeB 

~M 

i=0xeA 




(9) 


The second inequality above follows from the fact that the distribution when restricted to B at 
every temperature is unimodal and is maximized at ,. 


Lemma 8.3. Let P* = ^ where q > 4 log(2). For n sufficiently large, the continuous function 
7lp.(x) = Tip. iffiminn, (1 — X,„i„ —x)n,xn) has a unique maximum in the range 0 < x < 1 — Xmin tit 


X = 


-for a// / G {1,... ,M}. 


The proof of the result above appears at the end of this subsection. Rewriting the last expres¬ 
sion in (|9ll, we have 


<I >5 < 0{n) 



( 10 ) 
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We use the following properties of the stationary distribution to bound the conductance. The 
first fact is that the stationary weight of the disordered mode conditioned on being at a particular 
temperature is non-decreasing as we decrease p. 

Lemma 8.4. For / G {1,... ,M}, we have that for some C > 1, Tip. ^ {^n/s) > C'^p,(^f!/ 3 )- 
Proof. We have 

^P,--i(^>i/3) _ ZRGR(Pi)/7tp,(f^n/3) _ aeW ^ ^ 

^p,(^«/3) Z/?GB(P;-l)/ttp,_i(^^«/3) y f ” Vp,-i(«(a)-f?(ai/3)) 

^2 ^3/ 

for some C > 1. We obtain the last inequality by arguing as follows. Since Fl{o) is minimized 
at Oi = 02 = CT 3 = 1/3, for each a G Krgb, H{a) > //(ai/ 3 ). In fact, for each a / ai /3 it is 
the case that H{a) > erf for some constant c, and therefore for a / CT 1 / 3 , the ra- 

tios_p^^'/^' > K for some constant K > \. Moreover, since each of the terms 

gp,(77(c)-77(o,/3)) > Y and \Krgb\ = one can find a constant 1 < C < A' such that the in¬ 
equality above holds. □ 


Next, we observe that the height of the disordered mode increases faster than the height at ■ 


Lemma 8.5. There is a constant d > \ such that 


(W/ 3 ) 

’tp;(W/3) 


> d ■ 


TCr. (^2;^ . ) 
Pf —1 ^ ^min' 

. ) 

Pf ' '^mm' 


Proof Expanding the terms shows that 

Recall that — P;_j = 0{-^) while —//(ai/ 3 ) = 0.{n^), since Xmin is a constant. The 

claim follows since M = 0 (?i). □ 


By Lemma |83J the rate of increase of terms in the series in the denominator of (4) is at least 
a constant, d > I, times the rate of increase of terms in the series in numerator. Combining with 
Lemma lMl and using the fact that M = &{n), for some constants ^3 > 0 and 1/2 > L (fTOt implies 


<I >5 < 0{n) 




! + (#)+...+ (§ 


di 


^Pm(^«/3) 


1 + t/2 H“ * * * H“ ^2 




< o{n) (min(d 2 , d)). 

^Pm(“«/3) 

Proposition [8]2] follows by setting 5 = |ln(min(if 2 ,d)). 

Proof of Lemma lOl As in the proof of Lemma 1731 we define 


Ki{x) = 


r{\nmn)r{xn)r{{l-'kmm-x)n) 


Z(P.-) 



X G (0,1 ^min ) • 
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Neglecting the factors that are not dependent on x we can write the function that we would like 
to maximize as 

r(xn)r((l -Xmin-x)n) 

and show that the unique maximum is at x = (1 — ?iniin)/2. To test the sign of the derivative 


mV 

8{x) 


f' q' 

we compare the quantities j and ^ since both / and g are positive in the interval 


(0,1— Xmin). It can be verified that 


y = P,?l(4x-2(1 -Xmin)) 


and 


= n\ - 


g 


‘^^k + xn-l 


Thus, there is a stationary point at x = since "/ = 0 = We will argue that this is the 

unique stationary point in (0,1 —Xmin)- Using the integral approximation bound p — 1/cn, 
we have 

f/\ ' 


= 4p;n, 


and 


'\ ' 


= n 


■ + I 


{k + xn-\Y {k+{l-X,„in-x)n-iy 


>n^{ — + 


1 


Xn (1 ’^min x)fl J x(l -’c) 


n (1 ’k.niin ) 


Therefore, for n large enough. 


^V> 


4n 


1 '^mir 


> 4p"n > 4p,n = ( 


since p = V ^ constant. Since the derivative of ^ is greater than the derivative of ^ at each point 
in ((1 — Xmin)/2,l —Xmin), there are no stationary points in that interval. A similar argument 
by symmetry shows there are no stationary points in (0,(1— Xmin) /2). □ 


8.2. Upper bound for the Metropolis Algorithm on Q.rgb- The Metropolis Markov chain on 
is known to have exponential mixing time and the same argument also holds on Q.rgb- We 
would now like to derive a good upper bound on this mixing time so that we can compare it to 
the bound obtained for the tempering chain. However, bounding the conductance and applying 
Theorem 17.11 will not be sufficient as the square of the conductance gives too weak a bound. 
Instead, to obtain the best possible lower bound on the spectral gap of the Metropolis chain, we 
appeal to the comparison theorem iflOl . We use this technique to obtain a tight exponential upper 
bound for the mixing time. Let P be the Metropolis chain on ^rgb with stationary distribution 
n = Tip*. Then, the second part of Theorem 18. II is as follows. 

Proposition 8.6. Let P* = ^ where q > 41og(2) and let a = In (tip* )/tip* (f2i/3)) < 0. The 
Markov chain P mixes in time 
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The idea behind showing the mixing time claimed in Proposition 18.61 is to define a new dis¬ 
tribution n on ^Irgb by effectively eliminating the disordered mode. The Metropolis chain P is 
defined on ^Irob wifh sfafionary disfribufion n. We will show fhaf fhe mixing lime of P is al 
mosl a polynomial. The comparison fheorem fhen gives fhe required upper bound on fhe mixing 
lime of fhe Melropolis chain P in Proposition 18.61 Lef 

^ ■— {tJ £ ^RGB . CTi < and 7tp*(fl(j) ^ Tip* 

For a E Krgb define 


r Tip.(a,,„j(rvz if oE^ 

^ 1 7ip*(aa)(p*)/Z olherwise, 

where 

Z = £ Tip* ) + £ Tip* (flo) 

0€K aS:ClitGB\K 

is fhe normalizing parfifion function. 

For a configuralion x E 0.rgb, we define Tr(x) lo be uniform over all fhe configurafions in fhe 
same equivalence class, i.e., if x is in fhe equivalence class a 

Tt(x) = ( ” ) n{Q.a). 

\a1a2O3J 

The firsl sfep is fo show fhaf P, fhe Melropolis chain on fhe flaflened disfribufion, mixes in 
polynomial time. This will follow from an applicalion of fhe decomposition fheorem Il29l (see 
below). The second sfep will be fo use fhis bound and fhe comparison fheorem lo bound fhe 
mixing lime of fhe chain on fhe original unflallened space. This mixing lime of P will be a 
lower order term when we compare if lo fhe mixing lime of P, which is exponenfial. Thus, any 
polynomial bound on fhe mixing rate of P will suffice. 

Theorem 8.7. The Markov chain P with stationary distribution n mixes in polynomial time. 


To apply fhe decomposition fheorem here, we partition fhe space Q.rgb according lo fhe equiv¬ 
alence classes of configurations, i.e. info fhe space Krgb- Informally, If will be simpler lo bound 
fhe mixing lime of Q = P^, fhe Iwo step Iransilion malrix fhaf allows moves of lenglh 0, 1 or 
2. We can fhen infer fhe polynomial mixing of P from fhe polynomial mixing of Q. If is easy 
fo see fhaf Q is polynomially mixing when reslricfed lo flo, for any a, because Iwo-slep moves 
permute fhe colors on fhe verlices wilhoul changing fhe lolal number of each and fhe mixing 
lime can be bounded by fhaf of an inferchange process |[T] Chapter 14]. Hence, we focus on 
showing fhe bound on projection Markov chain Q. We will use fhe canonical palh mefhod. 

Theorem 8.8. The Markov chain Q on Krgb is polynomially mixing. 


Proof. For a E and x E fli;, define fhe canonical pafh as follows: Lef a = {t\,b\,g\) and 
T = {t 2 ,b 2 ,g 2 )- Assume fhaf ti >t 2 . If nol, fhe palh from a lo x consisls of fhe same vertices as 
fhe palh from x lo a bul wifh all edges directed oppositely. 

We define fhe canonical palh for ti odd and ^ 2 ^ the olher case only needs a minor lechnical 
modificalion due lo parity issues. Assume (wilhoul loss of generality by fhe symmelry of fhe col¬ 
ors blue and green) b\ < g\ and b 2 > g 2 - The palh is defined lo be {ti,b\ ,gi),{ti,bi + l,gi — 
If (f, (Wi+i\ (f, «-n + l n22h±l\ (t. —"X rG2ll+l\ (t.. tLdlf (t.. 
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1)^2 + 1), (^ 2 )^ 2 ,^ 2 )- It can be shown that along the path, the values of the distribution are uni- 
modal, i.e.. 

Lemma 8.9. For each a = = {t 2 ,b 2 ,g 2 ) G Krqb, the distribution K attains a unique 

maximum on the path x- 

We defer the proof till the end of this argument. Assuming the lemma, the congestion of the 
paths can be bounded as follows. 



A 


Since along every canonical path the distribution is unimodal, and the length of any path is at 
most linear in n, and there are at most polynomially many paths r(a, p) using the edge (a, p), A 


is at most a polynomial in n. 


□ 


Corollary 8.10. The Markov chain P on Krqb is polynomially mixing. 

Proof of Lemma 18.91 Let £t denote the set of a E Krgb such that Gi = t. Let £b=g denote the 
set consisting of configurations where the number of green and blue vertices are equal. Since the 
space is discrete, because of parity considerations, the canonical paths cannot simply go along 
the line then along the line Iqb and finally along £t 2 ’ except in the case that ti and t 2 are both 
even. For this case, it is sufficient to show that firstly, for all 1/3 < f < 1, along the lines 
the maximum is at the intersection with Iqb and secondly, along the line £gb, the distribution is 
unimodal. The observation is that the second fact implies that on the portion of the canonical 
path along £gb, the distribution is either 

i) non-increasing 

ii) non-decreasing 

iii) non-decreasing and then non-increasing 

but not decreasing and then increasing. Then in any of the three cases above, it can be verified 
that there is a unique local maximum along the path. 

In the other cases, when either both ti and t 2 are odd, or one is odd and the other even, the 
canonical path makes a “diagonal” move to switch parity and we have to argue that the property 
of being unimodal is not violated. It turns out that this is implied by the unimodality of the 
continuous function 7t on the lines it and Igb- We first show that along the lines it and ix the 
distribution n is unimodal. 

Claim 8.11. Let P* = ^ where p > 41og(2) and let £? = {a E LIrgb '■ tJi = t}. Then there 
exists a constant no, such that Vn > no the function 7t(G) when restricted to Lt is maximized at 
<^ 2=^3 = cind is non-increasing as G 3 decreases, for all X,nin < f < 1 - 

The proof follows by the same calculations made in the proof of Lemma [831 

Claim 8.12. Let p* = ^ where p > 41og(2). For n sufficiently large, 7tp*(fl;^) has a unique 
maximum X^ax in {l/'i,Xmin\ nnd is non-increasing on either side of it. 
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Proof. We examine the continuous extension 7t of the original distribution %. 



Neglecting factors not explicitly dependent on X, asymptotically, we obtain the function 


(3X^—2X+ 1 )—Xf! ln(X) — ( 1 —X)n ln( t—X ^ 


e 


The claim can be verified by differentiating it, solving for the stationary point Xmax, and check¬ 
ing the second derivative. By construction. Tip* is non-increasing on either side of Xmax for 

i < X < 1. 


□ 


Finally, along the “diagonal” portions of the path the change in the value of the distribution 
will be the net change if we were to move in a continuous fashion horizontally and then vertically. 
Since along both these segments the change will be of the same sign if the segments on either 
end are of the same type (increasing or decreasing), by the two claims above, the net change will 
be positive or negative as required by unimodality. 

The Metropolis chain at P* mixes torpidly, and by the above lemmas we can bound the mixing 
time. Note that the proof uses a stronger version of the Comparison Theorem. 

To use the comparison theorem to infer a bound on the mixing time of P from that of P we 
need good bounds on the parameters A and a. It turns out that A is the insignificant factor in the 
mixing time, rather, a determines the mixing time of P. In contrast, most previous applications 
of the comparison theorem consider chains with identical stationary distributions, so typically 
the parameter a = \. 

Proof of Proposition 18.61 We will use the refined comparison fheorem of Diaconis and Saloff- 
Cosfe, Theorem l5.21 Nofe fhaf fhe fwo Markov kernels are identical, buf fheir sfafionary disfribu- 
fions are very differenf near fhe disordered sfafe. Since fhe kernels are idenfical, we can simply 
define frivial canonical pafhs, i.e., when we decompose a sfep in fhe unknown chain Q wifh 
sfafionary disfribufion Tip* info a pafh using sleps from fhe known chain Q wifh disfribufion Ti, 
fhese pafhs all have lengfh 1. If can be verified fhaf fhe Mefropolis fransifion probabilifies on fhe 
fwo chains are always wifhin a polynomial factor of each ofher and maXji;(Ti(x)/Ti(x)) is af mosf 
a polynomial since flaffening fhe disfribufion has a negligible effecf on fhe partition function. 

Claim 8.13. Let P* = ^ where q > 41og(2). Then, 




Proof The upper bound is easy fo see by fhe definifion of Z. By fhe consfrucfion of fhe flaffened 
disfribufion, Z < Zrgb{^*)- For fhe lower bound, we have 


Z 
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The last inequality follows because for p* > pc, the stationary probability on Kj is at least 1 
of the total measure. □ 

Hence the parameter A is bounded by a polynomial. Finally, we can compare the largest 

variation in the distributions n and n to bound a. Let y be any configuration in G 1 / 3 , any y* a 

configuration in we have 

_ (x) _ ttp* {Q.x^^.,)ZrgbI2 ^ 1 

7tp,(x) “ 7tp.(fll/3) “ 7tp*(ai/3) nO(l) 

Plugging these bounds into the comparison theorem (Theorem l5.2l) then implies Proposition [ 8 ] 6 ] 
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